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Spec' 


A  class  of  probability  density  estimates  can  be  obtained  by  penalizing 


the  likelihood  by  a  functional  which  depends  on  the  roughness  of  the  logarithm 
of  the  density.  The  limiting  case  of  the  estimates  as  the  amount  of  smoothing 
increasing  has  a  natural  form  which  makes  the  method  attractive  for  data 


analysis  and  which  provides  a  rationale  for  a  particular  choice  of  roughness 
penalty.  The  estimates  are  shown  to  be  the  solution  of  an  unconstrained 
convex  optimization  problem,  and  mild  natural  conditions  are  given  for  them  to 
exist.  Rates  of  consistency  in  various  norms  and  conditions  for  asymptotic 
normality  and  approximation  by  a  Gaussian  process  are  given,  thus  breaking  new 
ground  in  the  theory  of  maximum  penalized  likelihood  density  estimation. 
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SIGNIFICANCE  AND  EXPLANATION 


The  basic  problem  considered  is  the  estimation  of  a  probability  density 
function  f  given  observations  from  the  density.  This  problem  arises 
virtually  wherever  data  are  collected,  and  is  of  particular  interest  in 
medical  and  engineering  applications.  Density  estimates  are  useful  for 
exploring  properties  of  the  data  and  for  presenting  data  in  a  way  comprehen¬ 
sible  to  the  layman.  They  are  also  used  for  constructing  versions  of  various 
statistical  techniques  ( for  example  in  automatic  diagnosis  and  pattern  recog¬ 
nition)  which  do  not  depend  on  specific  assumptions  about  the  underlying  model. 

A  particular  case  of  the  estimates  of  the  present  paper  is  the 
following.  Given  data  X^,...,X  ,  choose  the  estimate  f  to  maximize 

n  £  log  f (X , )  -  —  A  /  {(d/dx)  log  f(x)}  dx,  subject  to  /  f  *  1.  The 
i=1  x 

first  term  is,  in  a  sense,  the  goodness  of  fit  of  f  to  the  data,  while  the 
integral  is  a  penalty  for  how  'wiggly'  the  estimate  is.  The  parameter  X 
controls  the  amount  by  which  the  data  are  smoothed  to  obtain  the  estimate.  As 
the  parameter  A  tends  to  infinity,  the  estimate  converges  to  a  normal 
density  fitted  to  the  data,  and  thus  the  definition  of  an  'infinitely 
smoothed'  estimate  is  very  natural;  that  is  the  advantage  of  the  formulation 
of  this  paper  over  previous  methods. 

Because  of  their  implicit  definition,  density  estimates  obtained  in  this 
way  are  rather  intractable  and  little  is  known  about  their  behavior.  In  this 
paper  several  new  results  on  the  large  sample  behavior  of  the  estimates  are 
obtained  and  so  the  work  should  help  considerable  in  the  understanding  of 
roughness  penalty  procedures.  In  addition  a  characterization  of  the  estimates 
as  the  solution  to  an  unconstrained  convex  optimization  procedure  is  given; 
apart  from  its  mathematical  value,  this  should  be  very  useful  when  computing 
the  estimates  in  practice. 


The  responsibility  for  the  wording  and  views  expressed  in  this  descriptive 
summary  lies  with  MRC,  and  not  with  the  author  of  this  report. 


ON  THE  ESTIMATION  OF  A  PROBABILITY  DENSITY  FUNCTION 
BY  THE  MAXIMUM  PENALIZED  LIKELIHOOD  METHOD 

B.  W.  Silverman* 

1.  INTRODUCTION. 

Good  and  Gaskins  (1971)  introduced  the  idea  of  roughness 
penalty  density  estimation.  Their  idea  was  to  use  as  an  estimate 
that  density  which  maximized  a  penalized  version  of  the 
likelihood.  Given  observations  X^...fXn<  the  penalized  log 
likelihood  is  defined  as 

to(f)  -  l  log  f(X1)  -  ctR(f) 

r  2 

where  R(f)  is  a  1 f lamboyancy  functional*  such  as  J  (f")  and 
the  parameter  a  controls  the  amount  by  which  the  data  are 
smoothed  to  give  the  estimate.  (Use  the  convention  throughout  that 
unqualified  sums  are  over  the  range  i  =*  1  to  n.)  Without  the 
roughness  penalty  term  the  likelihood  is  unbounded  above; 
intuitively  the  maximum  likelihood  estimator  is  a  sum  of  delta 
function  spikes  at  the  observations.  The  Good-Gaskins  formulation 
can  be  given  a  Bayesian  justification;  see  their  paper  for 
details.  An  excellent  exposition  of  penalized  likelihood  estimates 
is  given  by  Tapia  and  Thompson  (1978). 
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In  this  paper  a  variation  of  the  Good-Ga skins  estimator  is 
discussed.  For  compelling  reasons  given  in  Section  2  below,  the 
logarithm  of  the  density  -  rather  than  the  density  itself  -  will  be 
penalized  for  roughness.  In  Section  3  it  will  be  seen  that  the 
resulting  constrained  minimization  can  be  replaced  by  an 
unconstrained  convex  optimization.  Section  4  is  concerned  with 
conditions  for  existence  of  the  estimator;  these  turn  out  to  be 
mild  and  elegant. 

In  the  remaining  sections,  the  asymptotic  properties  of  the 
estimator  are  discussed.  Very  little  is  known  about  the 
asymptotics  of  any  roughness  penalty  methods  of  density  estimation 
beyond  the  consistency  (in  a  very  strong  norm,  under  quite 
restrictive  conditions)  proved  by  de  Montricher  (1981),  and  also 
the  results  for  a  related  estimator  proved  by  Reiss  (1981).  Some 
rates  of  convergence  have  been  obtained  by  KLonias  (1981);  though 
his  results  are  for  different  estimators  than  ours,  they  appear  to 
be  weaker  insofar  as  a  comparison  is  possible.  For  the  estimators 
of  this  paper  far  more  can  be  deduced.  Sections  5  to  8  below  lead 
to  proofs  of  consistency  with  rates  in  a  variety  of  different 
norms.  In  Section  6  a  linear  approximation  is  developed  which  is 
of  considerable  conceptual,  as  well  as  mathematical,  value.  The 
main  consistency  results  are  given  in  Section  8.  The  question  of 
asymptotic  normality  is  discussed  in  Section  9,  where  a  uniform 
approximation  of  the  estimator  by  a  Gaussian  process  is  given. 


2.  DEFINITION  AMD  MOTIVATION. 

Practically  all  density  estimation  methods  have  the  property 
that  the  limiting  estimate  as  the  amount  of  smoothing  decreases  is 
a  sum  of  spikes  at  the  observations,  but  what  happens  as  the  amount 
of  smoothing  increases  depends  on  exactly  what  method  is  being 
used.  It  turns  out  that  roughness  penalty  estimates  with  a 
suitable  penalty  functional  have  a  very  attractive  property,  best 
illustrated  by  considering  a  special  case.  Suppose  that  the 
penalty 

00 

f )  ■  /  {{d/dx)3log  f(x)}2dx 

—  00 

is  used.  Then,  in  the  sense  made  clear  in  Theorem  2.1  below,  the 
limiting  estimate  as  the  parameter  o  tends  to  infinity  will  be 
the  normal  density  with  the  same  mean  and  variance  as  the  data. 

Thus  as  a  varies  the  method  will  give  a  range  of  estimates  from 
the  'infinitely  rough'  sum  of  delta  functions  to  the  'infinitely 
smooth'  maximum  likelihood  normal  fit  to  the  data. 

Computational  and  mathematical  difficulties  aside,  this 
observation  presents  a  very  strong  case  for  the  use  for  density 
estimation  of  the  roughness  penalty  method  with  penalty  R^.  Since 
one  of  the  objects  of  non-parametric  methods  is  to  investigate  the 
effect  of  relaxing  parametric  assumptions,  it  seems  sensible  that 
the  limiting  case  of  a  non-parametric  density  estimate  should  be  a 
natural  parametric  estimate.  These  remarks  also  give  a  satisfying 
rationale  for  the  choice  of  roughness  functional.  Previously  this 


choice  has  been  made  either  in  an  ad  hoc  way  or  for  reasons  of 
mathematical  or  computational  convenience. 

Another  advantage  of  this  formulation  is  that  the  functional 
u  depends  only  on  the  logarithm  of  the  density  and  so  any  density 
estimates  obtained  will  automatically  be  positive.  This  remark  is 
further  elucidated  below  in  Section  3  which  deals  with  conditions 
under  which  the  functional  u  has  a  maximum.  It  should  also  be 
noted  that  the  log  density  is  itself  a  very  natural  quantity  to 
estimate,  particularly  if  the  estimates  are  used  to  estimate 
likelihood  functions,  or  for  non-par ametric  discriminant 
analysis.  Leonard  (1978)  has  used  a  Bayes  lain  approach  to  density 
estimation  in  which  a  stochastic  process  structure  is  placed  on  the 
log  density;  this  differs  from  our  approach  both  in  its  motivation 
and  in  some  of  its  detail,  but  is  nevertheless  another  example  of 
penalizing  for  roughness  in  the  logarithm  of  the  density. 

It  is  possible  to  define  other  roughness  penalties  according 

to  other  perceptions  of  ' infinitely  smooth'  exponential  families  of 

densities.  The  essential  property,  easily  checked  for  the  case 

discussed  above,  is  that  R(f)  should  be  zero  if  and  only  if  f 

is  in  the  required  family.  For  example,  for  data  on  the  half  line, 
00 

2 

Rtf)  *  /  {(log  f)"}  will  give  rise  to  exponential  densities 
0 

being  the  limiting  case,  while  on  the  circle 
Rtf)  *  /  {(log  f)"'  +  (log  f)'}2  will  have  as  the  infinitely 
smooth  family  the  von  Mises  densities  defined  by 


tK  .  (9)  “  exp{ie  cos(  0  -  0Q)}  , 


and  discussed  in  detail  by  Mardia  (1972). 

He  conclude  this  section  with  some  definitions  and  the  theorem 
which  gives  the  form  of  the  limiting  estimates.  Suppose  that  the 
domain  of  definition  of  the  estimates  and  the  set  in  which  the 
observations  lie  is  a  connected  set  ft  in  Rd.  A  space  such  as 
the  circle  is  considered  to  be  an  interval  in  R1  with  periodic 
boundary  conditions  placed  on  all  the  functions  considered;  the 
imposition  of  these  boundary  conditions  will  not  affect  any  of  the 
results  of  this  paper. 

Suppose  that  D  is  a  linear  differential  operator  of  the  form 
D(g)  -  l  c  d 


where  the  sum  is  over  all  vectors  a  of  non-negative  integers 
satisfying 

1  <  l  aL  <  m 


for  some  fixed  integer  m.  Assume  that  at  least  one  of  the 
coefficients  c(o)  for  £  =  m  is  non-zero.  The  results  of 

Sections  2  and  3  will  also  hold  where  the  coefficients  c(a) 
depend  on  x,  but  for  the  subsequent  work  it  is  assumed  that  there 
is  no  dependence  of  this  kind.  Note  that  there  is  no  constant  term 
in  the  definition  of  D;  D(g)  depends  only  on  derivatives  of  g. 
Define  the  non-negative  definite  bilinear  form  [  ,  ]  by 

[g1fg23  =  /  D(g1)D(g2)  ; 
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here,  and  subsequently,  unqualified  Integrals  are  taken  to  be  over 
ft  with  respect  to  Lebesgue  measure. 

Let  S  be  the  set  of  real  functions  g  on  ft  for  which 
(1)  The  (m  -  1)th  derivative( s)  of  g  exist  everywhere  and 
are  piecewise  differentiable. 

(ii)  [g,gj  <  “• 

(iii)  /  e9  <  ». 

Then  given  independent  identically  distributed  observations 

A 

Xj,...,Xn  on  Q,  our  estimate  g  of  the  log  density  underlying 
the  observations  will  be  the  solution,  if  it  exists,  of  the 
constrained  optimization  problem 

1  n  1 
raaxb  I  9<x4>  ■  ?  *lg,g] } 
i-i 

subject  to  g  in  S  and  /  e9  =  1.  The  substitution  X  *  2a/n 
has  been  made  to  simplify  some  of  the  mathematical  expressions 
below.  Our  estimate  f  of  the  density  itself  is  given  by 
f  »  exp(g). 

The  null  family  of  the  quadratic  form  [  ,  ]  will  be 

defined  to  the  collection  of  densities  f  on  Si  for  which 
[log  f,  log  fj  is  zero.  It  is  easily  shown  that  the  null  family 
is  an  exponential  family,  with  at  most  (m  -  1)  parameters. 

The  following  theorem  gives  a  sense  in  which  the  'infinitely 
smooth'  estimator  has  the  required  form. 


Theorem  2.1.  Provided  f  is  a  density  with  log  f  ^n  S, 
define 

1  n  1 

u).(f)  -  -  l  log  f(X.)  -  -  A[log  f,  log  f]  . 

A  n  ^ ^  l  i 

Let  f ^  be  the  maximum  likelihood  estimator  within  the  null  family 
based  on  suppose  the  data  are  such  that  exists. 

Then,  given  any  density  f  *  f^  with  log  f  ir±  S,  for  all 
sufficiently  large  A 

to^f^)  >  w^(f)  . 

Proof. 

If  f  is  not  in  the  null  family  then  as .(f)  as  A  «°  while 

A 

<0.  ( f  )  remains  fixed.  If  f  is  in  the  null  family  then 
A  09 

(^(f^)  -  ol>x< f )  -  ^  l  {log  fa>(Xi)  -  log  f(Xi)}  >  0 

by  the  definition  of  f  as  a  maximum  likelihood  estimate.  In 
either  case  the  conclusion  of  the  theorem  holds,  completing  the 
proof. 

Let  f^  denote  the  density  (if  it  exists)  which  maximizes 
then  it  would  be  of  interest  to  investigate  further  in  what 
senses  f^  -*■  f^  as  A  ♦  ®.  We  shall  not  consider  this  question 
further  in  this  paper. 

3.  THE  ESTIMATE  AS  AN  UNCONSTRAINED  OPTIMUM. 

One  of  the  reasons  that  roughness  penalty  density  estimates 
present  computational  and  mathematical  difficulties  is  their 


Implicit  definition  as  a  solution  to  a  constrained  optimization 


problem.  The  results  of  this  section  show  that  our  estimates  can 
be  found  as  the  unconstrained  minimum  of  a  strictly  convex 
functional  without  any  unknown  Lagrange  multipliers.  This 
observation  has  both  mathematical  and  computational  value,  and  is 
the  foundation  of  the  theoretical  work  given  below.  Computational 
aspects  will  not  be  considered  here,  but  the  result  makes  it 
possible  to  use  standard  methods  for  unconstrained  convex  problems 
to  find  the  estimator;  these  will  be  explored  in  subsequent  work. 
For  g  in  S  and  fixed  X  >  0,  write 

A0(g)  -  l  X[g,g]  -  -  l  gU^  (3.1) 

and 

Mg)  -  \  X(g,g]  +  /  e9  -  £  \  gtx^  .  (3#2) 

We  show  that  the  unconstrained  minimum  of  A(g)  is  identical  with 
the  constrained  minimum  of  Ag ( g) . 

Theorem  3.1.  The  function  g  in  S  minimizes  Ag(g)  over  g 
in  S  subject  to  /  e9  «*  1  if  and  only  if  g  minimizes  A(g) 
over  S. 

Remarks .  Note  first  that  the  theorem  says  nothing  about  the 
existence  of  g;  this  question  is  considered  in  Section  4  below. 
Our  reason  for  proving  this  result  firBt  is  that  we  shall  only  deal 
with  the  existence  question  under  conditions  on  ft  which  are  not 
needed  for  the  present  argument. 

It  is  easily  shown  that  A  is  a  strictly  convex  functional 
on  S,  as  defined  on  p.  154  of  Tapia  and  Thompson  (1978)  and 


hence,  by  their  Theorem  2  on  p.  160,  it  is  an  immediate  corollary 


of  the  present  theorem  that  g  is  unique  if  it  exists  at  all. 
Proof  of  the  Theorem. 

Given  g  in  S,  define  g*  ■  g  -  log  J  e^,  so  that 
f  exp(g*)  **  1.  Since  (  ,  3  only  involves  derivatives, 

(g*,g*3  =  [g,g] •  Therefore  it  follows  by  elementary  manipulations 
that  A(g*)  =>  A(g)  +  1  -  j  e9  +  log  J  eg  and  so  A(g*)  <  A(  g) 
with  equality  only  if  /  e9  =  1,  since  t  -  log  t  >  1  for  all 
t  >  0,  with  equality  only  if  t  =  1.  Therefore,  g  minimizes 
A(  g)  if  and  only  if  g  minimizes  A( g)  subject  to  e^  =  1;  but 
subject  to  /  e9  *  1,  A(g)  and  AQ(g)  +  1  are  identical,  and  so 
the  proof  of  the  theorem  is  complete. 

Note  that  the  proof  of  the  theorem  depends  crucially  on  the 
fact  that  the  penalizing  functional  involves  only  derivatives. 

4.  EXISTENCE  OF  THE  ESTIMATORS. 

A  discussion  of  the  existence  properties  of  the  Good-Gaskins 
estimators  is  given  in  Chapter  4  of  Tapia  and  Thompson  (1978), 
drawing  on  material  from  de  Montricher,  Tapia  and  Thompson 
(1975).  It  is  clear  from  that  work  both  that  the  estimators 
defined  in  this  paper  cannot  be  shown  to  exist  by  existing  work, 
and  also  that  the  question  of  existence  can  be  a  little  delicate. 

The  theorem  of  this  section  gives  a  natural  and  elegant 
condition  for  the  existence  of  the  estimates.  For  convenience  the 


-9- 


theorem  is  stated  for  the  special  case  of  univariate  bounded  ft, 
but  remarks  about  generalizations  are  made  below. 

Theorem  4. t.  Suppose  ft  is  a  bounded  interval  in  r\  possibly 
subject  to  periodic  boundary  conditions.  Given  observations 
X.,*..,Xn  in  ft,  the  functional  A  as  defined  in  (3.2)  above 
will  have  a  minlmlzer  in  S  if  there  is  a  maximum  likelihood 
estimator  based  on  Xj,...,xn  in  the  null  family. 

Remarks .  The  condition  of  the  existence  of  a  maximum  likelihood 
estimate  in  the  null  class  is,  of  course,  a  very  mild  one.  In  the 
case  where  ft  is  the  circle  and  the  null  class  is  the  von  Mises 
family,  for  example,  all  that  is  required  is  at  least  two  distinct 
data  points.  It  is  interesting  to  compare  the  existence  condition 
to  the  condition  given  in  a  different  context  for  the  existence  of 
the  estimator  considered  by  Silverman  ( 1978b) »  it  is  presumably 
possible  to  extend  the  technique  of  this  proof  to  deal  with 
penalized  likelihood  estimators  of  quantitites  other  them 
probability  density  functions. 

Proof . 

The  proof  depends  on  properties  of  reproducing  kernel  Hilbert 
spaces t  see,  for  example,  Oden  and  Reddy  (1976)  for  an  account  of 
these.  Given  g1  and  g2  in  S,  define 

<g1#g2>“  tg1'g21  +  /  gig2  (4,1) 

and 

V? 

"Vo  "  <VVo  • 
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Since  ft  is  bounded,  the  norm  I •  1^  will  make  S  a  reproducing 
kernel  Hilbert  space  equivalent  to  the  Sobolev  space  Hm(ft). 
Define  subspaces  S^  and  S2  of  S  by 
S  »  {g  in  S:  [g,g]  *  0  and  /  g  ■  0} 

S2  *  {g  in  Si  /  g  ■  0  and  <  g^)  «  0  for  all  g^  in  S^} 
If  p  is  the  largest  eigenvalue  less  than  one  of  the  reproducing 


kernel  in  S,  then,  given  g2  in  S2, 

.  ,2  -1  f  2 

lg2  '0  >  P  /  g 2 

and  hence 

[g2,g2l  >  (i  -  p 

Since  ft  is  bounded,  by  the  conditions  imposed  on  m  and  R,  the 
Sobolev  embedding  theorem  implies  that  the  sup  norm  is  continuous 


with  respect  to  l»lQ  and  hence  there  is  a  constant  c  such  that. 


for  g2  in  S2, 


suplg2l  <C(g2,g2)/z. 


(4.2) 


Define  spaces 

S0' 

s* 

and 

S1 

so 

■  {g 

in 

S: 

/  9 

S* 

■  {g 

in 

S: 

/  e 

S1 

“  (g 

in 

S: 

/  • 

9  =  1}  , 

g  =  1  and  [g,gl  =0}  . 

Define  a  functional  A*  by,  given  g  in  S,  defining  AQ  as  in 
(3.1), 

A*( g)  *  AQ(g)  +  i°g  /  e9  • 

It  is  easily  shown  that  A*(g)  ■  A*(g  +  c)  for  all  constants  c 
and  hence  that  the  mappings  g  g  -  log  J  eg  and  g  g  -  /  g 

set  up  an  A*  preserving  (1-1)  correspondence  between  SQ  and 


S*.  Since  Aq  and  A*  coincide  on  S* ,  it  follows  that  Aq 
will  have  a  minimum  in  S*  if  and  only  if  A*  has  a  minimum  in 
SQ.  A  minimum  of  Aq  on  S*  is,  of  course,  precisely  the 
estimate  we  are  seeking;  therefore  it  will  suffice  to  show  that 
there  is  a  minimum  of  A*  on  Sq. 

Given  any  g  in  Sq,  write  g  “  g^  +  g2  with  g^  in  S^ 
and  g2  in  S2«  Then 

A* (g)  -  ~  X[g,g3  +  log  /  e9  +  1  -  £  £  g(Xi> 

>  \  *tg2,g2l  -  l  l  g2(Xi)  (4.3) 

g 

+  log{exp( inf  g2>  J  e  1 }  +  1  -  ^  l  g^x^ 
using  the  fact  that  [g,g]  *  ( g2 , g2 ] •  Prom  (4.3)  it  follows  that 


»*lg)  >  j  i(g2.92)  -  j  I  gjIXj) 


+  inf  g2  +  A*(gi)  +  1 


(4.4) 


The  (1-1)  correspondence  defined  above  between  SQ  and  S*  gives 
an  A*  preserving  correspondence  between  S^  and  S*.  On  S*, 

A*  is  precisely  -  —  J  g(X1),  and  so  the  log  density  g  of  the 
maximum  likelihood  estimator  within  the  null  class  will  be  a 
minimizer  of  A*  in  S*;  it  follows  that  g  -  /  g  will  be  a 
minimizer  of  A*  in  Sj.  By  Cauchy-Schwarz  it  is  easily  shown 
that  A*  is  strictly  convex  on  Sj,  and  hence,  since  S1  is  a 
finite  dimensional  norrned  space  on  which  A*  has  a  minimum,  it  can 
be  shown  by  elementary  functional  analysis  that  there  exist 


2l 


constants  C^  >  0  and  C2  such  that,  for  in  S^, 

1  +  A*{g1)  >  C1 lg1 lQ  +  C2  .  (4.5) 

Next  we  consider  the  terms  of  (4.4)  involving  g2*  Using 
inequality  (4.2)  it  follows  that,  for  fixed  X  <  0,  there  exist 
positive  constants  C3  and  C4  such  that 

I  x  tg,gl  -  £  l  g2(x1)  +  inf  g2 


>  2  Xf92'925  “  2  8U£>  *92*  (4*6) 
>  C3lg2l2  -  C4lg2l0 


Substituting  (4.5)  and  (4.6)  into  (4.4)  gives,  for  g  in  Sq, 

»*<g>  >  VVo  *  C2  ♦  Wo  -  Wo 


(4.7) 


>C5(,9l'0+  ,92,0)  +C6 

for  suitable  constants  Cg  >  0  and  Cg,  by  elementary  algebra. 
Prom  (4.7),  using  the  triangle  inequality, 

A*(g)  >  Cglglp  +  .  (4.8) 

By  Theorem  5,  p.  162  of  Tapia  and  Thompson  (1978),  it  follows 
that  A*  has  a  minimizer  on  Sq,  completing  the  proof  of  the 


theorem. 

The  extension  of  the  theorem  to  the  case  where  ft  is  a 
bounded  multivariate  domain  is  straightforward  provided  that  the 
supremum  operator  is  continuous  with  respect  to  the  norm  I • 
on  Si  this  will  entail  conditions  on  ft  and  on  D,  for  details 
of  which  see  a  text  on  Sobolev  spaces.  An  extension  to  unbounded 
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Q  will  require  a  different  technique  of  proof  since  it  will  no 
longer  necessarily  be  the  case  that  /  g  <  “  for  g  in  S. 

5.  ASYMPTOTIC  PROPERTIES  -  PRELIMINARIES . 

In  the  remaining  sections,  the  consistency  and  other 
asymptotic  properties  of  the  estimators  are  studied.  There  has 
been  very  little  work  on  the  consistency  properties  of  maximum 
penalized  likelihood  density  estimates;  the  main  contribution  is 
the  paper  of  de  Montricher  (1981),  whose  results  do  not  seem  to  be 
directly  applicable  to  our  estimates  and  who  does  not  consider 
questions  of  rates  of  consistency  or  of  asymptotic  distributions. 
See  also  Klonias  (1981).  Consistency  of  a  related  class  of 
estimators  has  also  been  considered  by  Reiss  (1981).  The 
techniques  used  in  this  paper  are  more  akin  to  those  used  in 
several  papers  of  Wahba  (e.g.  Wahba,  1977)  though  some  care  is 
needed  because  the  functional  A,  though  unconstrained,  is  not 
quadratic. 

For  the  remainder  of  the  paper  attention  will  be  restricted  to 
the  case  where  S)  is  a  bounded  univariate  domain,  possibly  with 
periodic  end  conditions.  The  extension  to  any  particular 
multivariate  case  will  depend  on  the  solution  to  the  eigenvalue 
problem  of  the  differential  operator  D  in  the  domain  ft;  once 
that  is  solved  the  arguments  of  this  paper  will  go  through  easily. 

It  will  be  assumed  that  the  observations  X^,...,Xn  are 
independent  and  identically  distributed  with  density  fQ  on  ft. 


6 


In  order  to  nake  what  is  quite  an  involved  argument  a  little  more 
transparent,  rather  stringent  smoothness  conditions  will  be  placed 
on  fg,  but  it  should  be  stressed  that  appropriate  versions  of  the 
theorems  remain  true  under  much  milder  assumptions,  and  can  be 
obtained  by  very  similar  techniques.  These  extensions  are  left  to 
the  reader  to  investigate. 

Let  gQ  -  log  fg.  Assume  throughout  that,  in  the  terminology 
of  Wahba  (1977),  gQ  is  very  smooth,  in  other  words  that  gQ  and 
its  periodic  extension  have  2m  derivatives  on  ft  and  f  (gQ  ) 
is  finite.  In  particular,  assume  that  gg  is  bounded  above  and 
below.  It  will  be  convenient  to  prove  results  about  the 
convergence  of  the  estimates  of  the  log  density  rather  than  the 
density  itself,  but  only  elementary  calculus  is  needed  to  transform 
these  back  to  results  about  the  density. 

The  minimizer  of  the  functional  A  of  (3.2)  will  be  denoted 
by  g.  The  explicit  dependence  of  g  and  related  quantities  on 
the  sample  size,  the  values  of  the  observations,  and  the  smoothing 
parameter  will  usually  be  suppressed,  as  will  the  dependence  of  the 
smoothing  parameter  on  the  sample  size.  The  basic  strategem  of  the 
consistency  proof  is  first  to  study  the  properties  of  a  function 
g1  (defined  in  Section  6  below)  which  is  a  linear  approximation 
to  g,  and  then  to  show  that  g  and  g1  are  sufficiently  close 
to  allow  results  for  g  to  be  obtained.  It  should  be  kept  in  mind 
throughout  that  although  g1  has  desirable  properties  which  help 
one  understand  the  behavior  of  g,  the  definition  of  g1  depends 
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on  the  unknown  density  fg  j  therefore  g1  is  only  a  mathematical 
device  and  cannot/  in  contrast  to  g,  be  calculated  in  practice* 
The  remainder  of  this  section  consists  of  definitions  and 
lemmas  which  set  up  the  technical  machinery  needed  in  the 
subsequent  sections*  A  first  reader  may  find  it  easier  to  skip  to 
Section  6  and  then  to  refer  back  as  necessary.  A  more  casual 
reader  could  skip  straight  to  Section  8,  where  the  main  results  are 
given. 

'ttiree  different  norms  will  be  used  in  the  study  of 
consistency;  these  will  be  defined  for  g  in  S  as  follows: 

«gi?  -  /  g2fn  ; 


Igl^  -  sup  ig|  ; 

n 


(5.1) 


»gig  -  fg*gJ  +  /  g2f0  • 

Inner  products  <  >2  and  <  >g  are  defined  by 

^1^2^  ■  /  *1*2*0  1 
<  Vg2>S  “  Cg1'g2J  +  /  glg2f0  * 

Since  fQ  is  bounded  above  and  below  away  from  zero,  the  norm 

m 

I  I  is  equivalent  to  the  Sobolev  norm  on  S  *  H  (£))  .  Suppose 
s 

:  v  >  0}  is  a  sequence  of  orthonormal  eigenfunctions  with 
respect  to  the  density  fg  of  the  reproducing  kernal  of  (  ,  >g; 
l.e.,  for  a  sequence  of  eigenvalues  {X  } 


I 


f.  .  v"*.  - 


and 


<  >s  *  \  5ij 


<^i^j>2  "  6ij 

where  6^  is  the  Kronecker  delta .  By  standard  arguments  (see, 
for  example,  Riesz  and  Nagy,  1955),  is  identically  equal  to  1 

and  the  eigenvalues  satisfy 

1  -  A0  >  X,j  ^^2  ^  **'  * 


Define  the  sequence  p  by 


then  it  is  immediate  that 


C  - 1 » 


ut'V  "  pi\j  • 

When  expanding  elements  of  S  in  terms  of  the  eigenfunctions,  we 
shall  use  an  additional  subscript  (enclosed  in  brackets  if  any 
confusion  between  subscripted  functions  and  coefficients  may  arise) 
to  denote  a  coefficient.  Thus,  for  example,  we  shall  write 


’o  ■  l  Wv  *  Wo  *  Wl  *  ”• 

»  ■  J  V,  •  »(0>*0  *  *  “•  • 

Unqualified  sums  over  v  will  be  taken  to  be  over  the  range 
v  *  0  to  «.  The  asymptotic  behavior  of  the  eigenvalues  can  be 
deduced  using  the  following  lemma,  which  says  that  replacing  fg 
by  the  constant  function  does  not  affect  the  rate  of  convergence  of 
the  eigenvalues  to  zero. 

Lemma  5.1.  Suppose  that  the  sequences  X  and  $  are  defined  as 


above,  and  suppose  that  X*  are  the  eigenvalues  of  the  L2 


9 


0  22. 


orthonormal  eigenfunctions  of  the  Inner  product 

defined  in  (4.1)  above.  Then,  for  all  v  >  0,  putting 

*v ■  C  -  ’  si  »*  ■  »•"’  '  ’< 

Pv  inf  f0  <  PV  <  Pv  8UP  f0  •  (5*2> 

Proof. 

The  eigenvalue  p^  will  satisfy 

2 

Pv  “  inf{[g,g]  :  /  g^fQ  “  0,  j  *  0,...,v  -  1,  and  /  g  fQ  >  1} 

<  inf  {[g,  g]  :  /  g^  -  0  and  /  g2  -  (inf  f Q i"1 }  (5.3) 

since  the  infimum  is  over  a  smaller  set.  By  an  analogous  argument 
to  Riesz  and  Nagy  (1955)  p.  238,  it  follows  from  (5.3)  that,  taking 
the  supremum  over  elements  h^,...,h^1  of  S, 

p  <  sup  inf  {[g,g]  :  /  gh  -0,  j  *  0, ...,v-1 

g  3 

and  /  g2  *  (inf  fQ)  3} 

*  Unf  fQ)’1?*  • 

The  other  inequality  of  (5.2)  is  proved  similarly,  completing  the 
proof. 

Corollary.  There  exist  positive  constants  a  and  6  such  that, 
for  all  v  >  0, 

-2m 

X  ■  c  v  ,  u<c  <  $  . 
v  v  v 
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< **  r—  -.A-.pi *a, '  a v1^’ *•  >* * n UK** ^ 


Proof. 

By  standard  properties  of  Sobolev  spaces,  it  is  easily  shown  that 

the  L2  orthonormal  eigenfunctions  of  the  inner  product  <  , 

-2m. 

have  eigenvalues  A*  which  decay  exactly  at  rate  v  To  see 

this,  note  that  the  eigenfunction  expansion  is  precisely  a  Fourier 
series  expansion  and  that  the  eigenvalues  are  reciprocals  of 
polynomials  in  v  of  degree  2m.  Now  apply  Lemma  5.1  to  obtain  the 
rate  of  decay  of  Ay. 

2 

Given  any  g  in  L  (ft) ,  it  is  now  possible  to  give 
expressions  for  the  norms  of  (5.2)  in  terms  of  the  coefficients  of 
the  eigenfunction  expansion  of  g. 

Lemma  5.2.  Suppose  g  is  in  L  (0)  and  g v  *  J  9<l>vfc*  Then 

-  I  *1  .  (5-11 

”,2s  ■  i  ■  l5-5) 

and,  given  e  >  0,  there  exists  C£  >  0  such  that 


.  .2  „  „  r  1+e  2 
n gl  <  C  )  v  g 
od  Z  L  V 


(5.6) 


Proof 


Equations  (5.4)  and  (5.5)  are  immediate  from  the  definition  of  the 

eigenfunctions  and  eigenvalues.  The  equations  (5.6)  and  (5.7) 

follow  by  considering  Hilbert  scales  of  spaces  as  considered,  for 

example,  by  Oden  and  Reddy  (1976)  Chapter  4.  From  there  (p.  133) 

and  the  corollary  above  it  follows  that  £  v  will  be 

equivalent  to  the  norm  of  the  fractional  Sobolev  space  of  index 

j  (1  +  e),  and  hence,  by  the  Sobolev  embedding  theorem  (see  the 

remarks  on  p.  109  of  Oden  and  Reddy)  the  inequality  (5.6)  follows 

—2  2 

at  once.  Similarly  the  norm  £  g^  is  equivalent  to  the 
H2m(ft)  Sobolev  norm,  proving  the  last  part  of  the  lemma. 

The  next  lemma  gives  the  asymptotic  behaviour  of  certain 
functions  of  the  eigenvalues  which  will  occur  in  Section  6  below. 
The  notation  f^(A)  ~  f2(A)  as  *  *  0  is  taken  to  mean  that 
f  ^  ( A )/f 2 ( X)  and  f2(A)/f.j(A)  are  bounded  as  A  -*•  0.  The  lemma 
is  a  generalization  of  the  estimates  obtained  by  Wahba  (1977)  for 
her  A(  A)  and  G(  A) . 

Lemma  5.3.  Given  a  <  4ra  -  1,  as  A  ♦  0, 


-(a+1 )/2m 


L  *  * 

v-1  (1  +  Apy)2 

is  very  smooth,  as  A  ♦  0, 

given 

a 

r  p  2  2 

♦  p  g. 

if  b  - 

V 

1  L  v  Ov 

(1  +  Apv)2 

[  -  o(X-b/2"l 

if  b  > 
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b  <  2m, 

0 

0 
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Proof. 

The  first  part  is  proved  using  the  corollary  to  Lemma  5. 1  by 
approximating  the  sum  by  an  integral  in  the  manner  of  Wahba  (1977) 
p.  660;  values  for  the  implied  constants  can  be  obtained  by  more 
careful  analysis.  The  second  part  is  obtained  by  an  application  of 
the  dominated  convergence  theorem.  Note  that  making  milder 
smoothness  assumptions  on  gQ  will  affect  the  rates  in  the  second 
part  of  the  lemma.  Some  care  is  necessary  if  0  is  not  a  periodic 
domain  though;  see  Rice  and  Rosenblatt  (1981).  The  final  part  of 
this  section  concerns  the  sample  coefficients  of  the  empirical 
distribution.  Define  a  random  sequence  6  by 


*o-° 


0  «  n  1  $  (X  )  for  v  >  0  . 

v  i-1  v  1 


The  8^  depend  on  n,  but  this  dependence  will  not  be  expressed 
explicitly.  Some  properties  of  the  8y  ate  given  in  the  following 
lemma,  the  proof  of  which  follows  immediately  from  the  facts  that 
■  1  and  that  the  are  orthonorraal  with  respect  to  fQ. 

Lemma  5.4.  Given  n,  the  sequence  6  satisfies 


EB  =  0  for  all  r;  and 
r  -  - 

EBB  =  n  ^  6  for  r  and  s  >  1  . 
r  a  rs  -  - 

It  is  now  immediate,  by  the  classical  central  limit  theorem,  that 

V, 

for  each  v  >  0,  n z B  will  have,  asymptotically,  a  standard 
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* 


4 


I 


\ 


i 

1 


normal  distribution.  Indeed  it  is  possible  to  provide  a 
simultaneous  strong  approximation  of  the  sequence  ^  by  a 
sequence  of  normal  random  variables}  this  is  done  in  the  following 
lemma/  which  is  the  last  result  of  this  section* 

Lemma  5.5.  On  a  suitable  probability  space,  defining  the  sequence 
as  above,  there  exists  for  each  n  a  sequence  8^ 
of  Independent  N(0,1)  random  variables  such  that,  with 
probability  1 , 

1a  _i  _i  1/  „ 

lim  sup  n2(log  n)  sup  v  |n28  -  8  I  <  C(f  ) 

n+-  v>1  v  v  0 

where  C(fQ)  is  a  constant  depending  only  on  fQ. 

Proof. 

Write  8y  *  /  4>y( t)dF^( t)  where  PR  is  the  empirical  distribution 

function  of  the  observations  and  then  proceed  as  in  the  proof  of 

Propositions  1  and  2  of  Silverman  (1978a),  approximating 

n^2  ( f  -  F)  (t)  by  a  transformed  Brownian  bridge  w° {F( t) }  using 
n  n 

Theorem  3  of  Komlos,  Major  and  Tusnady  (1975).  The  8^  are  the 

coefficients  of  the  expansion  of  W^{F(t)}  in  terms  of  the 

n 

eigenfunctions  and  are  easily  shown  to  have  the  required 
structure.  Defining  Zn(t)  as  in  Silverman  (1978a)  it  follows  as 
on  p.  179  of  that  paper  that,  using  the  fact  that  J  4>vfg  is  zero 
for  v  >  1, 


n( log  n)  | 


,J/2 


8-81 
v  v 


J  u: 


sup 


|z  (t) 
n 


(5.8) 
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For  fixed  fg,  by  p.  134  of  Oden  and  Reddy  (1976)  there  are 


constants  C^t  C2(fg)  such  that 


/  Ifl  <c,(J  If/ ft 


rV  3 


(5.9) 


°2(t0>C/2"  ■  0('', 


by  the  corollary  to  Lemma  5.1.  To  complete  the  proof  substitute 
(5.9)  and  (2)  of  Silverman  (1978a)  into  (5.8). 


6.  THE  LINEAR  APPROXIMATION. 

In  this  section  the  linear  approximation  g^  to  g  will  be 

defined  and  studied;  the  question  of  the  closeness  of  the 

approximation  will  be  considered  in  Section  7  below.  The 

approximation  is  linear  in  the  sense  that  it  is  a  linear  function 

of  the  transformed  observations  i>  (X.)  and  that  it  is  the 

rv  i 

solution  of  a  certain  linear  system  in  a  Hilbert  space.  It  is  this 
linearity  which  leads  to  the  tractability  of  the  approximation. 
Define  a  quadratic  form  A1  for  g  in  S  by 

A^g)  =  j  X[g,g]  +  /  (1  +  (g-g  )  +  j  <9-g0>2}f0  "  J  I  9<V  (6.1) 

i=  1 

The  motivation  behind  the  definition  of  A^  is  that  it  is  the 
quadratic  form  which  has  second  order  contact  with  the  functional 


J1-. 


;  .  'I 

4 


■  i 


A  at  gQ .  Furthermore*  by  Proposition  17  and  Theorem  6  of 
Appendix  X  of  Tapia  and  Thompson  (1978),  the  functional  A^  is 
uniformly  convex  on  S  and  hence  has  a  unique  minimizer  g1 
in  S. 

Though  the  function  g^  is,  like  g,  defined  implicitly,  it 
is  straightforward  to  write  its  eigenfunction  expansion  explicitly. 
Up  to  a  constant,  we  have,  for  g  in  S, 


.  "•! 


Vg)  -  \  \  l  Pvg2v+  9(0)  +  -  l  g2 


1% 


n 


J  g  g.  -  y  J  n  g  6  (X . ) 
L  ’v*0v  vrv  i 

\H)  i=*1 


i  I  upv+  Dg2  -  I  (g0v  +  ev)gv 


(6.2) 


where  we  have  used  the  fact  that  n  ^  \  ^  ( X^ )  *  ^  ^°^-^ows 

from  (6.2)  that  the  coefficients  of  g1  satisfy 


*  4 
H 


gn  +  6 
J1  v  “  1  +  Xp 


(6.3) 


Studying  these  coefficients  gives  several  asymptotic  results  for 
g0  -  g1 .  Notice  that  the  form  (6.3)  can  immediately  be  decomposed 
into  its  systematic  and  random  components,  so  that 


E(g,  -  g„  )  ■  Xp  g„  (1  +  Xp  ) 
1v  Ov  v  Ov  v 


-1 


(6.4) 
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and 


g.  -  Eg.  ■  0  ( 1  +  Xp  ) 
’1v  1v  v  v 


-1 


(6.5) 

Consideration  of  (6.4)  and  (6.5)  as  X  ♦  0  shows  that  there  is,  as 
in  most  smoothing  problems,  a  trade  off  between  bias  and  random 
error. 

It  is  very  straightforward  to  apply  the  results  of  Section  5 
to  give  asymptotic  properties  of  g^,  and  this  is  done  in  the 
following  theorem.  Both  the  uniform  and  the  L  rates  of 
convergence  will  be  required  in  Section  while  the  Sobolev  rate 
is  included  for  its  own  interest. 

Theorem  6. 1 .  Defining  g1  as  above,  and  using  the  definitions  of 

Section  5  for  the  various  norms,  as  X  ♦  0  and  n  ♦  », 

2  -i  -l/2m  2 

Elg  -  gnC  ~  n  A  '  +  X 

I  \j  Zt 

,2  ..  -1  ~(2m+1 )/2m 

Elg1  -  gQlg  *  0(n  X  )  +  0(X) 


and,  given  6  >  0, 


,2  .  -6,  -1-1/m  A  (4m-1 )/2m  . 

Elg1  -  g0*w“  otx  (n  X  +  X  )} 


Proof. 


From  lemma  5.2  and  (6.3)  it  follows  that 


'*1  ’  90'2  “  l  (91v  "  »0v*  “  l 


„  (-Xp  g„  +  $  ) 
2  r  Kv’0v  v 


(t  +  Xpv)* 


and  hence,  by  Lemma  5.4 

E,g1  "  go'2  " 


,2  2  2 

X  p  gn  ,  “ 

v  Ov  J.  r 

2  +  n  ^ 


\H)  (1  +  Xpy) 


vl  (1  +  Xp  ) 
v 
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Substituting  the  bounds  given  by  Lemma  5.3  completes  the  proof  of 
the  first  part  of  Theorem  6.1.  The  second  and  third  parts  are 
proved  in  exactly  the  same  way,  by  first  applying  Lemma  5.2  to  give 
a  bound  for  the  appropriate  norm  of  (g^  -  g^)  in  terms  of  the 
coefficients  ~  and  then  applying  Lemmas  5.4  and  5.3. 

It  is  easy  to  deduce  conditions  under  which  g1  will  converge 
to  g0  in  various  norms.  In  addition  optimal  rates  of  convergence 
can  be  obtained;  these  will  be  discussed  further  after  it  has  been 
shown  that,  under  suitable  conditions,  Ig  -  g^  I  can  be  neglected 
relative  to  ig^  -  gQl. 

7.  CLOSENESS  OP  THE  LINEAR  APPROXIMATION  TO  THE  TRUE  MINIMIZER. 

In  this  section  the  closeness  of  g^  to  g  will  be 
considered.  The  arguments  are  a  little  involved,  mainly  because 
the  functional  A,  while  being  strictly  convex,  is  not  uniformly 
convex.  The  major  part  of  the  section  is  taken  up  with  the  proof 
of  the  following  lemma;  at  the  end  of  the  section  a  corresponding 
result  for  the  Sobolev  norm  is  discussed.  The  notation  0p 
denotes  an  order  of  magnitude  in  probability. 

Lemma  7.1.  Suppose  the  definitions  and  conventions  of  Sections  5 
and  6  are  used,  and  that  X  0  and  n  X  +  »  for  some  6  >  0 
as  n  ♦  «.  Then,  for  all  sufficiently  small  e  >  0,  as  n  ♦  «®, 

1^  -  ,,I  -0  (X-e{n-V,/m  ♦  . 

1  00  P 
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The  proof  of  the  lemma  proceeds  in  several  stages*  First  a  new 


approximation  gM  to  g  is  defined,  for  which  it  is  the  case  that 
the  uniform  convergence  of  gM  to  g q  will  imply  that  g  and 
gM  are  eventually  identical.  The  function  gM  is  the  minimiser 
of  a  functional  A^j  the  derivative  of  A^  at  g^  can  be  bounded 
in  such  a  way  as  to  enable  rates  of  stochastic  convergence  to  sero 
of  sup|gH  -  g1 |  to  be  obtained,  and  these  rates  are  easily  shown 
to  apply  to  sup|g  -  g1 |  also.  Choose  a  number  M  such  that 

suplg0!  +  2  <  M 

and  define  the  function  expM  by 


expM(x) 


{1  +  (x  +  M)  +  ^  (x  +  M)2}e”M 


for  x  <  -M  » 
otherwise. 


Define  a  functional  Am  on  S  by 


1  1  ** 

\,<g)  -  2  Ate»gJ  +  /  e*pM(g)  ”  “  l  g(Xi) 


and  let  gM  denote  the  minimizer  of  A^/  this  functional  is 
easily  shown  to  be  uniformly  convex  as  defined  by  Tapia  and 
Thompson  (1978),  and  hence  gM  exists  and  is  unique. 

In  the  remainder  of  the  argument,  derivatives  of  functionals 
will  be  used;  these  are  Gateaux  derivatives  as  discussed  by  Tapia 
and  Thompson  (1978).  Note  first  that  A^^  and  A  and  hence  their 
first  and  second  derivatives  agree  if  sup|g|  <  M;  it  is  easy  to 


show  that  these  derivatives  exist  everywhere.  By  the  strict 


4 

* 

14 


convexity  of  A  and  Ajj,  their  respective  minima  correspond 
exactly  to  zeros  of  A'  and  A^  and  hence  g  and  gN  will  be 
equal  if  sup|gMl  <  M. 

Defining  A^  as  in  <6.1),  since  g^  minimizes  Aj  it 
will  be  the  case  that  Aj(g^)  is  zero,  in  other  words,  for  all 


u  in  S, 


Alg^uJ  +  /  u{1  +  <g1-g(J)}f0  -  n_1  \  u(X1)  -  0  .  (7.1) 


Now,  substituting  (7.1),  we  will  have 


A'(g1)(u)  -  X{g1#u]  +  /  u  exp(g1 )  -  n  1  £  u(XA) 
-  /  u[exp(g1)  -  {1  +  (g1-90)}exp(g0)] 


(7.2) 


By  elementary  analysis,  there  exists  a  constant  C  such  that, 
provided  sup|g1  -  gQ|  <  1, 

|A’(g1)(u)|  <  C  /  |u|(g1  -  g0)2exp(gQ) 

C 

<  C  sup | u |  lg1  -  gQl2  i 

by  standard  functional  analysis,  using  the  operator  norms 
corresponding  to  those  defined  in  (5.1)  above. 


(7.3) 


^(g,)'.  <c  Ig,  -  g0i2  • 

Since,  under  the  assumption  suplg^  -  gg|  <  1,  it  follows 
a  fortiori  that  suplg^j  <  M  and  hence  A^g^)  ■  A'fg^)* 


(7.4) 


Mow  consider  the  operator  A^j(  g) .  Given  any  u  in  8, 


A£(g)(u,u)  -  A[u,u]  +  /  u2exp^(g)  >  A{u,u}  +  e  M  /  u2f0 


r  ..  ,  -am.  i  r.  -i-e.,  ,  -am.  l+e  z 

)  (Ap  +  e  )u  ■  >  v  (Ap+e  )v  u 
"  v  V  L  V  V 

V  V 


for  any  e  >  0.  By  elementary  analysis  it  follows  that 


v  „0  ,(1+e)/2M  r  1+e  2 

A-(gHu.u)  >  C^A  Iv  uv 


>  c  Al1+e)/2m(sup|u|)2 

Hf  t 


(7.5) 


for  suitable  positive  constants  c”  and  C  ,  by  Lemma  5.2. 

M#0  N| € 

Now  set  Ujj  »  g1  -  gM.  Apply  Taylor’s  theorem  to  the  function 
( af  t)  A^(gM  +  tuMHuM)  to  obtain,  for  some  0,  0  <  0  <  1, 


AA(g1)(V  *  *M(gM  +  * 

Combining  (7.4),  (7.5)  and  (7.6)  it  follows  that 


(7.6) 


CM.e*(1+E)/2m<8UpluM,)2  *  C  8Up|um'  lgl  *  Va 


so  that,  for  a  suitable  constant  C^,  provided  suplg^  -  gQ|  <  1, 


Buplu^jl  <  CjA 


-( 1+e)/2m  _  2 

,91  90  *2  * 


(7.7) 


Under  the  conditions  stated  in  Theorem  7.1,  it  follows  from 
Theorem  6.1  that 

sup|g1  -  g0l  ■*■  0  in  probability  (7.8) 

and  hence 
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■A 


'll 

t  I 

*41 


P( sup 1 9^  -  gfl|  <  1)  1  n  ♦ 


(7.9) 


Again  from  Theorem  6.1  we  have 


(7.10) 


'  Vj  -  "pi""' *'1/2"  *  »2i 


Combining  (7.7),  (7.9)  and  (7.10)  givea,  for  all  e  >  0, 


sup|g1  -  g^l  -  0  (X  e{n“1x“1/a  +  X(4"“1)/2*})  .  (7.11) 


In  particular  it  follows  that  suplg.  -  g  |  ♦0  in  probability} 

v  n 

combined  with  (7.8)  this  implies  that  sup | g^  -  9  |  ♦  o  in 
probability,  from  which  it  is  immediate  that 

P(sup|gMl  <  M)  ♦  1  as  n+»; 

by  the  remarks  made  near  the  beginning  of  the  proof  this  implies 


P(«k  *  9)  ♦  0  (7.12) 

and  hence  the  proof  of  the  lemma  follows  from  (7.11). 

No  attempt  will  be  made  to  obtain  a  finer  bound  for 
Ig  -  9^*2  since  the  bound  obtained  from  Lemma  7.1  will  suffice. 

A  bound  for  the  difference  between  g  and  g^  in  the  Sobolev  norm 
is  given  in  the  following  lemma. 

Lemma  7.2.  (Aider  the  same  conditions  as  Lemma  7.1 


,  .  .  -1 ,-(2m+1 )/2m 

"9  "  «s  -  0p(n  X  +  X) 


as  n  tends  to  infinity. 
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Proof 


The  argument  is  very  similar  to  the  proof  of  Lemma  7.1.  Since  the 
sup  operator  is  continuous  in  the  Sobolev  norm  it  follows  from 
(7.4)  that,  for  a  suitable  constant  Cg,  provided 
sup|g1  -  gQ|  <  1, 

■vv's  <  v».  -  ’o1!  • 

By  an  argument  similar  to  that  used  to  demonstrate  (7.5) ,  using 

Lemma  5.2,  for  all  g  and  u  in  S  we  have,  for  a  suitable 
S 

constant  C  , 
n 

A£(g)(u,u)  >  c®X  lulg  . 

It  can  now  be  deduced  that,  provided  supl g1  "Sol  <  1 

'Vs  •  0(A_1)  '*1  -  V*  ; 

the  remainder  of  the  proof,  making  use  of  (7.12),  exactly  parallels 
that  of  Lemma  7.2. 

8.  THE  MAIN  CONSISTENCY  RESULTS. 

It  is  now  possible  to  state  and  prove  conditions  under  which 
g  is  in  various  senses  a  consistent  estimator  of  gQ  and  to  give 
rates  for  this  consistency.  These  are  given  in  the  following 
theorem,  it  should  be  stressed  again  that  the  conditions, 
particularly  those  placed  on  the  smoothness  of  gQ,  can  be 


weakened  considerably  by  extending  the  arguments  used  in  Sections 
5,  6  and  7  above. 


Theorem  8.1.  Suppose  Cl  is  a  bounded  univariate  domain,  possibly 

with  periodic  end  conditions.  Suppose  the  true  density  fQ  on  Q 

is  bounded  above  and  below  away  from  zero?  let  gQ  ”  log  fQ. 

Suppose  the  roughness  penalty  [g,g]  is  defined,  using  a 

differential  operator  of  order  m,  as  in  Section  2  above,  and  that 

the  log  density  estimate  g  is  defined  as  in  Section  3  above, 

baaed  on  independent  identically  distributed  observations 

( 2  m)  2 

X1,...,Xn  from  fQ.  Suppose  that  f  (g^  )  <  »  and  that 

(2m—1 )  ^ 

9q  is  continuous  on  the  periodic  extension  of  ft. 

Suppose  throughout  that  the  smoothing  parameter  X  satisfies, 

for  some  6  >  0, 

H|.  j 

X  ♦  0  and  n  X  ®  as  n+<». 

Then  g  is  uniformly  consistent  as  an  estimator  of  gg  and  in 
addition,  for  all  e  >  0, 


sup  |g  -  g  |2  -  Oo{x’e(n'1x'1/n  + 

a  °  P 

If,  in  addition,  n(2/3)®"6^  •  as  n  +  » 

then,  as  n  ♦  », 


x(4“"1)/2a)}  . 

for  some  6  >  0, 


I  - 


V2fo  ’ 


_  ,  -1  -1/2ra  A  2. 
O  (n  X  +  X  ) 

P 


and  this  rate  la  exactly  attained.  Defining  the  Sobolev  norm 
•gig  ■  [g»gl  +  /  g2fQ»  provided  X  0  and  nj^2m+1)/2m  +  „  aa 
n  -*•  «®,  the  estimator  g  is  consistent  for  gQ  in  Sobolev  norm. 


and,  as  n  +  •, 


14  •  Vs 


0  (n-Y,2”1,/2*)  ♦  o  (X) 
P  P 


The  proofs  of  all  the  parts  of  the  theorem  are  obtained  by 

combining  the  relevant  part  of  Theorem  6. 1  with  either  Lemma  7.1 

2 

(for  the  L  and  uniform  consistency  and  rates)  or  Lemma  7.2  (for 
the  Sobolev  consistency  and  rates) .  In  all  cases  it  is  the 
lg^  -  gQl  part  which  dominates,  the  term  Ig  -  I  being 
negligible.  The  details  are  straightforward  and  are  therefore 
omitted. 

It  is  possible  to  investigate  optimal  rates  of  consistency  and 
the  corresponding  rate  of  convergence  of  X  to  zero.  For  mean 
square  convergence,  corresponding  to  convergence  of  the  estimated 
density  to  the  true  density  in  the  KUllback-Leibler  information 
distance,  the  (exact)  optimal  rate  of  consistency  is  easily  shorn 
to  be  0(n“4n^^4oH' 1  * )  attained  when  X  -  n  This  rate 

of  convergence  near  to  0( n-1 )  is  of  course  a  consequence  of  the 
strong  smoothness  conditions  placed  on  gQ.  The  corresponding 
results  for  the  other  norms  are  left  to  the  reader  to 
investigate.  It  is  interesting  to  note  that  the  optimal  rate  for 
X  for  good  estimation  in,  for  example,  the  Sobolev  norm  will  not 
be  the  same  as  the  optimal  rate  for  mean  square  consistency.  Thus 
one  will  not  necessarily  obtain  good  estimates  of  the  derivatives 


of  fg  by  seeking  good  estimates  of  f  itself,  a  point  relevant 


to  Silverman  (1980)  and  the  subsequent  rejoinder  of  Good  and 
Gaskins  (1980). 

The  question  of  strong  consistency,  as  considered  for  slightly 
different  estimators  by  de  Montricher  (1981),  is  not  considered  in 
this  paper,  though  it  seems  intuitively  clear  that  analogous 
results  to  Theorem  8.1,  possibly  with  slightly  slower  rates,  should 
be  provable  by  suitable  techniques.  The  question  of  the  asympotic 
normality  of  the  estimates  is  considered  in  Section  9  below. 

9.  APPROXIMATION  BY  A  GROSS IAN  PROCESS. 

To  the  author's  knowledge,  roughness  penalty  density 
estimators  are  the  only  density  estimators  that  have  not  been  shown 
under  suitaible  conditions  to  be  asymptotically  normal.  It  turns 
out  to  be  possible  not  only  to  show  that  the  estimators  discussed 
in  this  paper  are  pointwise  asymptotically  normal  but  also  to  give 
a  rate  of  approximation  to  the  estimators,  suitably  normalized,  by 
a  Gaussian  process.  An  approximation  of  this  kind  is  more  in 
keeping  with  the  modem  theory  of  density  estimators  and  opens  the 
way  to  proving  results  such  as  those  of  Bickel  and  Rosenblatt 
(1973)  and  Silverman  (1976)  on  the  asymptotic  behavior  of  certain 
functionals  of  the  estimates.  It  shows  that  the  joint  distribution 
of  the  value  of  the  estimator  at  several  points  is  asymptotically 
multivariate  normal  and  also  gives  a  rate  of  convergence  to  this 
normal  limit. 
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Before  considering  the  approximation  itself,  it  is  convenient 
to  consider  some  preliminaries.  The  notation  and  conventions  of 
this  section  will  be,  except  where  otherwise  stated,  the  same  as  in 
Sections  5  to  8  above.  In  particular  all  the  conditions  stated 
near  the  beginning  of  Section  5  will  be  assumed  to  hold.  Defining 
the  eigenfunctions  as  in  Section  5,  define  the  function 

R^  on  S5  x  !J  by 


Rx(s,t) 


r  ♦v(9)*v(t) 

l (1  +  x<v 


The  function  is  the  reproducing  kernel  with  respect  to  the 

density  fQ  corresponding  to  the  inner  product 
A[g1,g2]  +  /  <3^2f 0  on  S  an<*  1,8  also  the  Green>  8  function  of  a 
certain  differential  operator;  for  the  connections,  see  a  modern 
text  on  differential  equations. 


Define  a  function  m^(t)  to  be  Eg^(t),  so  that  the 
coefficients  of  ra  satisfy 


"xv'  W1  *  Xpv’ 


It  follows  that 


m  (s)  -  /  R  (s,t)g.(t)f(t)dt 
A  n  A 


(9.1) 


so  that  m  can  be  seen  to  be,  in  a  certain  sense,  a  smoothed 

A 

version  of  gQ.  Define  the  function  r^  by 


rx(..C) 


V  v  v 
L  2 

v-1  (1  +  Xpv) 


f 
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it  can  be  shown  easily  that 


r^(s,t)  -  /  R^(s,u)R^(t,u) fQ(u)du  -  1 


and  again  1  +  r^  is  the  Green's  function  of  a  certain 

differential  operator.  In  addition/  by  methods  similar  to  those 

used  in  the  proof  of  Theorem  6.1,  we  have,  for  all  s, 

-1/2m 

r^(s,s)  -  X  as  X  ♦  0  .  (9.2) 

It  is  now  possible  to  state  and  prove  the  main  result  of  this 
section. 

Theorem  9.1.  Suppose  that  the  conditions  of  Theorem  8. 1  hold.  For 
each  n,  on  a  suitable  probability  space  there  exists  a  Gaussian 
process  y^(s)  with  mean  zero  and  covariance  function  r^(s,t) 
such  that 


g(s) 


JA 

m^(s)  +  n  2  y^(s) 


+  err  ,(s) 
n,  X 


where  the  functions  m^  and  r^  are  as  defined  above  and,  given 
6  >  0,  the  approximation  error  err  is 

III  A 

_  ,,-6.  -1,-1/m,  .  -(4m-1)/2mv, 

0  (X  (n  X  log  n  +  X  )} 

P 


uniformly  over  s  ^n  R  as  X  ♦  0  and  n  ♦  “. 

Proof. 

The  result  is  a  consequence  of  Lemma  5.5  on  the  approximation  of 
the  by  normal  random  variables.  Note  that  the  distribution  of 

y  does  not  depend  on  n,  and  so,  as  in  Silverman  (1976)  and 

A 

(1978a),  theorems  about  its  behavior  as  X  -*•  0  can  be  combined 


with  Theorem  9. 1  to  provide  results  for  9  under  transparent 

conditions  connecting  X  and  n . 

Define  the  &  as  in  Lemma  5.5  and  define  y,  by 
v  'A 


V1 


-  w> 

Vi  ”  ♦  X’v> 


It  is  easily  verified  that  y^  is  a  well  defined  random  element 
of  S  and  is  a  Gaussian  process  with  Ey^(s)  3  0  and 
cov{y^(s)  ,  y^(t) }  *  r^(s,t).  Using  (6.3)  it  follows  that  the  error 
process  will  satisfy 

-1/  „ 

«  (  &  -  n  /2  3  )  4.  ( s) 

errn,Xla)  “  I  +  9(3)  -  9,0  0.3) 

V=1  V 

Using  Lemma  5.2,  the  supremum  over  s  of  the  sum  in  (9.3)  is, 
given  e  >  0,  dominated  by  a  constant  multiple  of 


{I  v1' +C(BV  -  nJ/2  pv)2(1  +  Xpv)'2}V2  (9.4) 

Now  substitute  the  bound  of  Lemma  5. 5  on  6  -  n  z  0  to  show 

v  v 

that  (9.4)  is,  with  probability  1, 

fr  3+e,,  .  .  , -2  i*/>  .  -1,  ,  _ ,  -(4+e)/4m  -1  . 

v  (1  +  Xp^)  }  ^0(n  log  n)  ■  0(X  n  log  n) 

by  Lemma  5.3.  Substituting  this  bound  and  (7.11)  into  (9.3) 


completes  the  proof  of  Theorem  9.1 


It  is  easy  to  use  (9.2)  and  theorem  9.1  to  construct 
conditions  under  which  r^(t,t)  ^  (g(t)  -  ra^(t)}  is  asymptotically 
standard  normal  and  this  is  left  to  the  reader  to  investigate. 

10.  DISCUSSION  AND  ACKNOWLEDGMENTS. 

There  are  of  course  numerous  questions  still  unanswered  about 
roughness  penalty  density  estimates.  Apart  from  those  technical 
questions  raised  in  the  body  of  this  paper  there  are  several 
important  points  of  a  practical  nature  which  have  not  been 
discussed.  Some  heuristic  calculations  done  by  the  author  and  not 
included  here  suggest  that  the  estimates  of  this  paper  may  provide 
a  solution  to  the  problem  of  finding  estimates  which  are 
automatically  adaptive  to  the  tails  of  the  distribution;  most 
existing  methods  either  under  or  oversraooth  the  tails  relativ  .  to 
the  main  part  of  the  data.  Another  important  problem  is  the  design 
of  efficient  and  well  understood  data-based  methods  for  choosing 
the  smoothing  parameter,  though  it  should  be  the  case  that 
techniques  from  other  density  estimation  methods  can  be  adapted  for 
use  here.  Finally  it  is  of  course  important  to  have  good  computer 
algorithms  for  finding  the  estimates! 

The  author  gratefully  acknowledges  useful  discussions  with 
Dennis  Cox,  Vassilios  KLonias,  Tom  Leonard,  Finbarr  O'Suileabhain, 
Grace  Wahba  and  James  Wendelberger . 
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